In [1]:
%pylab inline
%qtconsole
In [10]:
import sklearn
from sklearn import cluster,datasets
In [7]:
N=10**4
D_gen=2
K_gen=10
In [27]:
X,A_true=sklearn.datasets.make_blobs(n_samples=N, n_features=D_gen, centers=K_gen, cluster_std=1.0, center_box=(-100.0, 100.0))
In [12]:
K=10
N_inits=500
max_iter=20
In [17]:
%%timeit
centroids,A_clust,c,=sklearn.cluster.k_means(X, K, init='random',
precompute_distances=True, n_init=N_inits,
max_iter=max_iter, random_state=None, copy_x=True,
n_jobs=1)
The documentation claims that the precompute_distances (True by default) will consume more memory but make the algorithm faster. For the our goal (run K-Means many times with a small number of iterations) this is not the case, as shown by the result below. The documentation does not offer insight on the influence that this parameter is having under the hood.
In [18]:
%%timeit
centroids,A_clust,c,=sklearn.cluster.k_means(X, K, init='random',
precompute_distances=False, n_init=N_inits,
max_iter=max_iter, random_state=None, copy_x=True,
n_jobs=1)
The K-Means function from the scikit-learn already supports CPU parallelization by changing a simple parameter in the function. As expected, using multiple cores resulted in a speedup, albeit small. The parallelization is being done on the pairwise distances computation. It breaks down the pairwise matrix by the number of jobs selected.
In [21]:
%%timeit
centroids,A_clust,c,=sklearn.cluster.k_means(X, K, init='random',
precompute_distances=True, n_init=N_inits,
max_iter=max_iter, random_state=None, copy_x=True,
n_jobs=-1)
In [24]:
%%timeit
centroids,A_clust,c,=sklearn.cluster.k_means(X, K, init='random',
precompute_distances=False, n_init=N_inits,
max_iter=max_iter, random_state=None, copy_x=True,
n_jobs=-1)
It should be noted that it seems that this implementation is being accelerated by compiled C code.n
This are the times from the code above before the installation of optimized MKL sklearn (from the Anaconda Accelerate addon).
1 loops, best of 3: 5.67 s per loop
1 loops, best of 3: 4.05 s per loop
1 loops, best of 3: 3.29 s per loop
1 loops, best of 3: 2.57 s per loop
In [2]:
import numbapro
from numbapro import jit, cuda, int32, float32, float64
#@jit(complex64(int32, float32, complex64), target="cpu")
In [3]:
numbapro.check_cuda()
Out[3]:
In [5]:
#CPU version
@numbapro.vectorize(['float32(float32,float32)',
'float64(float64,float64)'],target='cpu')
def cpu_sincos(x,y):
return math.sin(x) * math.cos(y)
#GPU version
@numbapro.vectorize(['float32(float32,float32)',
'float64(float64,float64)'],target='gpu')
def gpu_sincos(x,y):
return math.sin(x) * math.cos(y)
In [6]:
n=10**6
x = np.linspace(0,np.pi,n)
y = np.linspace(0,np.pi,n)
print 'Data of type ',x.dtype
print 'Unoptimized\t',
%timeit np.sin(x) * np.cos(y)
print 'CPU vectorize\t',
%timeit cpu_sincos(x,y)
print 'GPU vectorize\t',
%timeit gpu_sincos(x,y)
del x,y
Unoptimized and CPU vectorize times are similar because sin() and cos() calls dominate time. GPU is quite faster already and would get even faster in a better GPU (current GPU has 48 CUDA cores).
All the data has to go from the RAM to the GPU memory, band back again. The work is distributed across all threads on the GPU and scheduling and memory management are taken cared of.
In [39]:
@numbapro.vectorize(['float32(float32,float32,float32,float32)'])
def cpu_powers(p,q,r,s):
return math.sqrt(p**2 + q**3 + r**4 + s**5)
@numbapro.vectorize(['float32(float32,float32,float32,float32)'],target='gpu')
def gpu_powers(p,q,r,s):
return math.sqrt(p**2 + q**3 + r**4 + s**5)
When target is not specified, it defaults to 'cpu'.
In [40]:
## Benchmarking
#Generate data
n=5e6
p = np.random.random(n).astype(np.float32)
q = np.random.random(n).astype(np.float32)
r = np.random.random(n).astype(np.float32)
s = np.random.random(n).astype(np.float32)
print 'Unoptimized\t',
%timeit np.sqrt(p**2 + q**3 + r**4 + s**5)
print 'CPU vectorize\t',
%timeit cpu_powers(p,q,r,s)
print 'GPU vectorize\t',
%timeit gpu_powers(p,q,r,s)
del p,q,r,s
Anaconda Accelerate includes the MKL optimization to several libraries. I still have to check if the numpy routines are not faster than the normal ones.
The speedup of GPU over CPU is of $$2.1 s / 69.4 ms = 30.26$$
We've now seen that @vectorize accelerates functions, specially on GPU, but it only works on scalar functions. Even though more complex operations like matrix multiplication can be decomposed in several scalar operations over its elements, it would be more convenient to operate on arrays directly.
In [7]:
@numbapro.guvectorize(['void(float32[:,:], float32[:,:], float32[:,:])'],'(m,n),(n,p)->(m,p)',target='gpu')
def batch_matrix_mult(a,b,c):
for i in range(c.shape[0]):
for j in range(c.shape[1]):
tmp=0
for n in range(a.shape[1]):
tmp += a[i,n] * b[n,j]
c[i,j]=tmp
#batch_matrix_mult.max_blocksize = 32 # limits to 32 threads per block
This is the code from the video. It seems to be a simple matrix multiplication though. In the video they say that it's a batch multuplication...
In [9]:
n=4e6
dim=2
print 'Memory for both matrices:\t',((2*n*dim*dim*4)/1024),' KBytes'
a=np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim)
b=np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim)
print 'Exception when too much resources of GPU are used:'
try:
batch_matrix_mult(a,b)
except Exception as exception:
print exception
batch_matrix_mult.max_blocksize = 64 # limits to 32 threads per block
import numpy.core.umath_tests as ut
print 'Unoptimized\t',
#c = np.dot(a,b)
%timeit c = ut.matrix_multiply(a,b)
print 'GUVectorize GPU\t',
%timeit c2 = batch_matrix_mult(a,b)
del n,dim,a,b
In [23]:
del n,dim,a,b
From the documentation:
There are times when the gufunc kernel uses too many of a GPU’s resources, which can cause the kernel launch to fail. The user can explicitly control the maximum size of the thread block by setting the max_blocksize attribute on the compiled gufunc object.
To control the size of the thread block we use (e.g. for 32 threds per block):
very_complex_kernel.max_blocksize = 32
The speedup was of
In [11]:
n=4e6
dim=2
print 'Memory for both matrices:\t',((2*n*dim*dim*4)/1024),' KBytes'
a=np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim)
b=np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim)
dc = numbapro.cuda.device_array_like(a)
da = numbapro.cuda.to_device(a)
db = numbapro.cuda.to_device(b)
def check_pure_compute_time(da,db,dc):
batch_matrix_mult(da,db,out=dc)
numbapro.cuda.synchronize()
import numpy.core.umath_tests as ut
print 'Unoptimized\t',
#c = np.dot(a,b)
%timeit c = ut.matrix_multiply(a,b)
print 'Memory management'
batch_matrix_mult.max_blocksize = 64 # limits to 32 threads per block
print 'Automatic:\t\t',
%timeit c=batch_matrix_mult(a,b)
print 'Manual:\t\t\t',
%timeit check_pure_compute_time(da,db,dc)
del n,dim,a,b,da,db,dc
Kernel function
A grid is a group of blocks (that can be 1D,2D or 3D) and each block is a group of threads (which can also be 1D, 2D or 3D). Every thread will execute the same kernel function.
When we're using @jit, we're writing functions on the CUDA execution model. However, we wrote functions that executed on GPU before with a return type (which can't happen here). That is because before we were using the @vectorize and not working under the CUDA execution model.
In [12]:
@numbapro.guvectorize(['void(float32[:,:], float32[:,:], float32[:,:])'],'(m,n),(n,p)->(m,p)',target='gpu')
def matrix_mult(a,b,c):
for i in range(c.shape[0]):
for j in range(c.shape[1]):
tmp=0
for n in range(a.shape[1]):
tmp += a[i,n] * b[n,j]
c[i,j]=tmp
#batch_matrix_mult.max_blocksize = 32 # limits to 32 threads per block
In [13]:
n=4e6
dim=2
print 'Memory for both matrices:\t',((2*n*dim*dim*4)/1024),' KBytes'
a=np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim)
b=np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim)
dc = numbapro.cuda.device_array_like(a)
da = numbapro.cuda.to_device(a)
db = numbapro.cuda.to_device(b)
def check_pure_compute_time(da,db,dc):
batch_matrix_mult(da,db,out=dc)
numbapro.cuda.synchronize()
del a,b,da,db,dc
In [47]:
tpb=10e6 / (10*10)
print tpb
print np.sqrt(tpb)
In [46]:
from numbapro import *
from timeit import default_timer as timer
m=1000
n=1000
A = np.array(np.random.random((n,m)), dtype=np.float32)
C = np.empty([n,n])
B = np.array(np.random.random((m,n)), dtype=np.float32)
%timeit X=np.dot(A,B)
@cuda.jit(void(float32[:,:],float32[:,:],float32[:,:]))
def cu_square_matrix_mul(A, B, C):
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bx = cuda.blockIdx.x
by = cuda.blockIdx.y
bw = cuda.blockDim.x
bh = cuda.blockDim.y
x = tx + bx * bw
y = ty + by * bh
n = C.shape[0]
if x >= n or y >= n:
return
cs = 0
for i in range(n):
cs += A[y,i]*B[i,x]
C[y,x]= cs
cuda.syncthreads()
def check_pure_compute_time(dA,dB,dC):
stream = cuda.stream()
blockdim = 10,10
griddim = 10,10
cu_square_matrix_mul[griddim,blockdim,stream](dA, dB,dC)
dC.to_host()
stream.synchronize()
stream = cuda.stream()
dA = cuda.to_device(A, stream)
dC = cuda.to_device(C, stream)
B = np.array(np.random.random((m,n)), dtype=np.float32)
dB = cuda.to_device(B, stream)
check_pure_compute_time(dA,dB,dC)
"""
print
print("Numpy took %f seconds" % numpy_time)
print("CUDA JIT took %f seconds, %.5fx speedup" % (cuda_time, numpy_time / cuda_time))"""
In [25]:
In [29]:
my_gpu = numba.cuda.get_current_device()
print "Running on GPU:", my_gpu.name
cores_per_capability = {1: 8,2: 32,3: 192,}
cc = my_gpu.compute_capability
print "Compute capability: ", "%d.%d" % cc, "(Numba requires >= 2.0)"
majorcc = cc[0]
print "Number of streaming multiprocessor:", my_gpu.MULTIPROCESSOR_COUNT
cores_per_multiprocessor = cores_per_capability[majorcc]
print "Number of cores per mutliprocessor:", cores_per_multiprocessor
total_cores = cores_per_multiprocessor * my_gpu.MULTIPROCESSOR_COUNT
print "Number of cores on GPU:", total_cores
In [19]:
%qtconsole
In [2]:
import numbapro
from numbapro import *
In [88]:
def dist(a,b,c):
N,K = c.shape
dim = a.shape[1]
for n in range(N):
for k in range(K):
dist=0
for d in range(dim):
diff = a[n,d]-b[k,d]
dist += np.power(diff,2)
c[n,k]=dist
In [ ]:
def dist2(a,b,c):
N,K = c.shape
dim = a.shape[1]
for k in range(K):
dist = a - b[k]
dist = dist ** 2
c[:,k] = dist.sum(axis=1)
In [23]:
@numbapro.guvectorize(['void(float32[:,:], float32[:,:], float32[:,:])'],
'(m,n),(p,n)->(m,p)',target='gpu')
def guvect_dist(a,b,c):
N,K = c.shape
dim = a.shape[1]
for n in range(N):
for k in range(K):
dist=0
for d in range(dim):
diff = a[n,d]-b[k,d]
#dist += np.power(diff,2)
dist += diff ** 2
c[n,k]=dist
In [179]:
@numbapro.cuda.jit("void(float32[:,:], float32[:,:], float64[:,:])")
def cu_dist(a,b,c):
"""
# thread ID inside block
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
# block ID
bx = cuda.blockIdx.x
by = cuda.blockIdx.y
# block dimensions
bw = cuda.blockDim.x
bh = cuda.blockDim.y
# compute thread's x and y index (i.e. datapoint and cluster)
n = tx + bx * bw # row for each datapoint
k = ty + by * bh # column for each cluster
"""
n,k = numbapro.cuda.grid(2)
ch, cw = c.shape # c width and height
if n >= ch or k >= cw:
return
dist = 0.0
for d in range(a.shape[1]):
diff = a[n,d]-b[k,d]
dist += diff ** 2
c[n,k]= dist
#cuda.syncthreads()
def cu_dist_wrap(a,b,c,blockDim,gridDim):
dA = cuda.to_device(a)
dB = cuda.to_device(b)
#dC = cuda.to_device(c)
#dC = numbapro.cuda.device_array(c)
dC = numbapro.cuda.device_array_like(c)
#stream = cuda.stream()
cu_dist[gridDim,blockDim](dA,dB,dC)
#dC.to_host()
#stream.synchronize() #block untill stream is done
dC.copy_to_host(ary=c)
In [171]:
#%debug
##generate data
n = 1e4
d = 2
k = 20
total_bytes = (n * d + k * d + n * k) * 4
print 'Memory used by arrays:\t',total_bytes/1024,'\tKBytes'
print '\t\t\t',total_bytes/(1024*1024),'\tMBytes'
data = np.random.random((n,d)).astype(np.float32)
clusters = np.random.random((k,d)).astype(np.float32)
#dists = np.empty((n,k)).astype(np.float32)
distsDim = np.int(n),np.int(k)
blockDim = 8,6 # GT520M has 48 CUDA cores
numBlocks = np.ceil((n * k) / 48)
gridSquareSide = np.int(np.ceil(np.sqrt(numBlocks)))
gridDim = gridSquareSide,gridSquareSide
print 'Excesss threads:\t',
print np.prod(gridDim)*np.prod(blockDim) - (n * k)
print '\n','Shape of dists:',dists.shape
In [172]:
distsCUDA = np.empty(distsDim)
distsCPU = np.empty(distsDim)
dist2(data,clusters,distsCPU)
cu_dist_wrap(data,clusters,distsCUDA,blockDim,gridDim)
In [173]:
#%timeit -r 1 dist(data,clusters,dists)
#%timeit -r 1 guvect_dist(data,clusters,dists)
%timeit cu_dist_wrap(data,clusters,distsCUDA,blockDim,gridDim)
#print type(dists)
#del data,clusters,dists
In [85]:
speedup=2.92 / 2.58e-3
print speedup